February 2026 — Draft
Transformer architectures have become the dominant approach for sequence modelling, but their mechanisms for encoding positional information vary widely. The original transformer [1] used fixed sinusoidal encodings added to input embeddings. Subsequent work introduced relative position schemes: T5 [2] uses learned additive biases on attention logits indexed by bucketed relative distance, while Rotary Position Embedding (RoPE) [3] encodes relative position through rotations applied to query and key vectors.
While these schemes have been extensively benchmarked on NLP tasks, less is known about how they affect the geometry of learned internal representations—particularly in continuous-valued time series settings where the signal has known low-dimensional manifold structure. This work uses a synthetic time series with precisely controlled geometric properties to study how positional encoding influences both training dynamics and the intrinsic dimensionality of representations at each transformer layer.
We construct a continuous time series that mimics syllable-structured sequential data (e.g. birdsong). A point traverses one of $C = 10$ circles embedded in $\mathbb{R}^{D}$ with $D = 20$, switching between circles according to a sparse Markov transition matrix with ring connectivity and long-range shortcuts.
Each circle $c \in \{1, \ldots, C\}$ is defined by a 2D plane $\text{span}(\mathbf{u}_c, \mathbf{v}_c)$ in $\mathbb{R}^D$, where $\mathbf{u}_c, \mathbf{v}_c$ are orthonormal vectors. The trajectory on circle $c$ at angle $\theta$ is:
$$\mathbf{x}_c(\theta) = r_c \left( \cos\theta \cdot \mathbf{u}_c + \sin\theta \cdot \mathbf{v}_c \right)$$where $r_c$ is the radius. Each circle has a fixed angular velocity $\omega_c$, with periods ranging from 40 to 400 time steps (a 10× speed range). Dwell times on each circle average ~400 steps, quantised to whole revolutions so that entry and exit angles are consistent.
The degree of geometric overlap between circles is controlled by constraining the 2D planes to a shared subspace of dimension $d_{\text{sub}} \leq D$. When $d_{\text{sub}} = D = 20$, each circle occupies a nearly orthogonal plane, producing minimal overlap. When $d_{\text{sub}} = 4$, ten 2D planes are forced into a 4D subspace, creating significant trajectory overlap—the model must rely on temporal dynamics rather than instantaneous geometry to distinguish circles.
Isotropic Gaussian noise is added to every observation: $\mathbf{y}_t = \mathbf{x}_t + \boldsymbol{\epsilon}_t$ with $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_D)$. We set $\sigma = 2.83$, yielding SNR $\approx 2.5$. The noise floor for masked MSE loss is $\sigma^2 = 8.0$.
Figure 1. Left: A sample window of the 20D time series (heatmap) with Markov state labels (top strip). Right: UMAP of the raw data with $d_{\text{sub}} = 20$ showing 10 well-separated circular manifolds.
We train a BERT-style [4] masked prediction model adapted for continuous time series. The architecture is:
$$\text{Input} \; (B, T, 20) \;\xrightarrow{\text{mask}}\; \text{Linear}(20 \to d) \;\xrightarrow{\text{PE}}\; N \times \text{TransformerEncoder} \;\xrightarrow{}\; \text{MLP}(d \to 20)$$
where $d = 128$, $N = 7$ layers, 4 attention heads, and FFN dimension 512 with GELU activation.
Masked positions (15% of each window) are replaced with a learnable [MASK] embedding before projection.
The loss is MSE computed only on masked positions:
where $\mathcal{M}$ is the set of masked time indices. Masking uses stochastic contiguous patches with sizes drawn uniformly from $[8, 128]$ time steps.
We compare three positional encoding strategies:
Sinusoidal (absolute). Following Vaswani et al. [1], fixed sinusoidal encodings are added to the input embeddings:
$$\text{PE}(t, 2i) = \sin\!\left(\frac{t}{10000^{2i/d}}\right), \quad \text{PE}(t, 2i+1) = \cos\!\left(\frac{t}{10000^{2i/d}}\right)$$T5 relative bias (learned). Following Raffel et al. [2], a learned per-head bias $b_h(i - j)$ is added to the attention logits before softmax. Relative distances are mapped to $B = 64$ buckets using a logarithmic scheme: exact buckets for small distances, log-spaced buckets for distances up to $d_{\max} = 1024$. No positional information is added to the input embeddings.
RoPE (rotary). Following Su et al. [3], queries and keys are rotated by position-dependent angles before the dot product. For head dimension $d_h$ and position $t$, the rotation frequencies are $\theta_i = 10000^{-2i/d_h}$. The rotation is applied to pairs of dimensions:
$$\text{RoPE}(\mathbf{q}, t)_{2i:2i+1} = \begin{pmatrix} \cos(t\theta_i) & -\sin(t\theta_i) \\ \sin(t\theta_i) & \cos(t\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}$$
This ensures that $\langle \text{RoPE}(\mathbf{q}, t), \text{RoPE}(\mathbf{k}, s) \rangle$ depends only on $t - s$, encoding relative position directly into the attention geometry.
RoPE requires a custom attention module; we implement this using PyTorch's scaled_dot_product_attention for flash-attention compatibility.
All models use identical hyperparameters: AdamW optimiser with peak learning rate $3 \times 10^{-4}$, weight decay 0.01, linear warmup for 20 epochs followed by cosine decay over 500 total epochs.
Batch size is 128, sequence length is 1024 time steps with stride 64.
Training uses BF16 mixed precision and torch.compile on an NVIDIA RTX 5090 GPU.
The dataset consists of 200,000 time steps with a 90/10 train/validation split.
After training, we extract intermediate representations from each transformer layer by running the full dataset through the model without masking. We analyse these representations using two complementary methods:
UMAP visualisation. We use Uniform Manifold Approximation and Projection [5] to produce 2D embeddings of each layer's output, coloured by circle identity. This reveals qualitative structure: when circles separate into distinct clusters, the model has learned to disentangle the latent states.
Levina-Bickel intrinsic dimension estimation. We estimate the intrinsic dimensionality of each layer's representation using the maximum-likelihood estimator of Levina and Bickel [6]. For a point $\mathbf{x}$ and its $k$-th nearest neighbour at distance $R_k(\mathbf{x})$:
$$\hat{m}_k(\mathbf{x}) = \left[ \frac{1}{k-1} \sum_{j=1}^{k-1} \log \frac{R_k(\mathbf{x})}{R_j(\mathbf{x})} \right]^{-1}$$The global estimate averages over all points. We report estimates at $k \in \{10, 30, 100\}$ to capture structure at different scales: small $k$ reflects local geometry while large $k$ captures global manifold structure.
The three positional encoding schemes exhibit markedly different convergence behaviour, despite identical architectures and hyperparameters.
Figure 2. Training loss curves for sinusoidal (left), T5 relative bias (centre), and RoPE (right). All runs use identical architecture, data, and hyperparameters; only the positional encoding differs.
| Metric | Sinusoidal | T5 | RoPE |
|---|---|---|---|
| Best val MSE | 8.06 | 8.03 | 7.996 |
| Epochs to noise floor | ~200 | ~100 | ~50 |
| Training plateau | Yes (epochs 80–130) | No | No |
Table 1. Comparison of training dynamics across positional encoding schemes.
The sinusoidal model exhibits a plateau between epochs 80–130, suggesting the model passes through a more entangled intermediate representational state before finding a good solution. T5 eliminates this plateau, converging smoothly. RoPE converges roughly 4× faster than sinusoidal, reaching the noise floor by epoch ~50 with no plateaus or step transitions.
We use the Levina-Bickel estimator at $k = 30$ to track how intrinsic dimensionality evolves through the network. The input data has an estimated dimension of 11.4.
| Layer | Sinusoidal | T5 | RoPE |
|---|---|---|---|
| Input (20D) | 8.0 | 11.4 | 11.4 |
| Layer 1 | 12.4 | 7.4 | 9.0 |
| Layer 2 | 10.7 | 6.8 | 6.5 |
| Layer 3 | 9.6 | 6.1 | 4.5 |
| Layer 4 | 8.7 | 5.4 | 3.6 |
| Layer 5 | 7.2 | 4.7 | 2.9 |
| Layer 6 | 4.4 | 3.7 | 2.3 |
| Layer 7 | 2.6 | 2.5 | 1.9 |
| Output (20D) | 1.2 | 1.6 | 1.6 |
Table 2. Levina-Bickel intrinsic dimension estimates ($k = 30$) at each layer for the three positional encoding schemes.
The sinusoidal model shows a characteristic expansion-then-compression pattern: Layer 1 increases dimensionality from 8.0 to 12.4 (projecting the 20D input into a 128D space that initially entangles the representations), followed by gradual compression through Layers 2–7. The major dimensionality squeeze occurs in the final two layers.
T5 eliminates the Layer 1 expansion entirely (7.4 vs 12.4), producing monotonic compression from the first layer onward. RoPE goes further: while Layer 1 is slightly higher than T5 (9.0 vs 7.4), it compresses more aggressively through the middle layers. By Layer 4, RoPE has already reached 3.6—a level that the sinusoidal model doesn't achieve until Layer 6. The final-layer dimension of 1.9 for RoPE is the closest of all three schemes to the true 1D structure of each circular manifold.
Figure 3. UMAP of intermediate representations for the sinusoidal model. The circles become progressively separated through the layers, but Layer 1 shows a tangled, high-dimensional state.
Figure 4. UMAP of intermediate representations for the T5 model. Circles begin separating earlier (Layer 2–3) with a smoother progression.